Unsupervised Wisdom from Medical Narratives

Exploring Medical Narratives on Older Adult Falls

Exploring unsupervised machine learning methods to extract insights from medical narratives about older adults (age 65+) fall
Author

Ifechukwu Mbamali

Published

29-11-2023

Introduction

The analysis made use of embeddings, dimensionality reduction, clustering algorithms, network graphs and text summarization techniques to effectively identify, and understand themes from medical narratives on older adults falls.

Key findings: The use of embeddings in combination with dimensionality reduction techniques proved effective in extracting cluster themes. DBSCAN1 outperforms k-means in cluster identification. Patients in the “Alcohol-Related Head Injuries and Falls” group tend to be younger, while the “Atrial fibrillation related falls” group was generally older, and the “Syncope-Related Head Injuries” group had a higher rate of severe cases compared to others. In comparison to the previous year (2021), cases involving “Head Injuries from Falls”, “Syncope-Related Head Injuries” and “Rib Injuries from Falls” saw the most significant increase in the average number of cases.

  • 1 Density-based clustering algorithm, introduced in Ester et al. 1996, which can be used to identify clusters of any shape in data set containing noise and outliers. DBSCAN stands for Density-Based Spatial Clustering and Application with Noise

  • Ultimately, insights gained through this analysis can help inform policies and interventions to reduce older adult falls. Competition hosted by Centers for Disease Control and Prevention.

    Data Overview

    The analysis made use of 2 data-sets:

    • Primary data-set
    • OpenAI embeddings data-set
    Code
    #recode the encoded variables in the dataset to human-readable values
    mapping <- fromJSON("data/variable_mapping.json")
    # Convert to data frames so we can use in joins
    mapping_tables <- list()
    for (col in names(mapping)) {
        mapping_tables[[col]] <- data.frame(
            ind=as.integer(names(mapping[[col]])),  # change to integer types
            values=unlist(mapping[[col]])
        )
    }
    Code
    # Load primary data
    pdf <- read.csv("data/primary_data.csv" )
    
    # Join and replace encoded column
    for (col in names(mapping)) {
        pdf <- pdf %>%
            left_join(mapping_tables[[col]], by=setNames("ind", col)) %>%
            mutate(!!col := values) %>%
            select(-values)
    }
    Code
    emb2c = emb2b |> select(1:51)
    Code
    #################################
    ## speed-up notebook rendering ##
    #################################
    
    emb2d = fread("data/embeddings_pca.csv")#embeddings_pca_x.csv
    Code
    emb2d  = emb2b|>
      select(cpsc_case_number)|>
      bind_cols(emb2d )
    Code
    as.datatable(formattable(pdf|>
                               head(1)
                             ), rownames = F,
    filter = 'top', 
    options = list(
      pageLength = 10, autoWidth = F,
      order = list(list(2, 'desc'))#asc
    ),
    class = 'bootstrap'
    )
    Code
    #table view of the first 51 columns of the raw embeddings file
    as.datatable(formattable(emb2c|>
                               head(5)
                             ), rownames = F,
    filter = 'top',
    options = list(
      pageLength = 10, autoWidth = F,
      order = list(list(2, 'desc'))#asc
    ),
    class = 'bootstrap'
    )

    Text cleaning and pre-processing

    Some of the general preprocessing steps include:

    • New fields: Additional columns were introduced to the primary data-set to determine severity levels based on disposition column and another to categorize activities based on the narratives.
    • Replacing Medical Abbreviations: The narrative column was also processed by replacing abbreviations with full clinical definition to improve readability.
    • Creating new columns: Severity Level and Activity
    Code
    #create a column called "severity_level" that says "severe" if the number 4 or 5 is contained in the "disposition" column, and "not severe" otherwise
    pdf <- pdf |>
      mutate(severity_level = ifelse(grepl("4|5", disposition), "severe", "not severe"))
    
    #create a column called "activity" that captures the text between "-" and "(" or "-" and "," if the term "ACTIVITY" is contained in the "product_1" column, and "others" otherwise
    
    pdf <- pdf |>
      mutate(activity = ifelse(grepl("ACTIVITY", product_1),
                               sub(".*-(.*?)[(,].*", "\\1", product_1),
                               "others"))
    
    #modifies the "activity" column by replacing "others" with "fainted" if "SYNCOPAL" is contained in the "narrative" column
    pdf <- pdf |>
      mutate(activity = ifelse(activity == "others" & grepl("SYNCOPAL|DIZZY|WEAK|WEAKNESS|SYNCOPE", narrative), "fainted", activity))
    
    #modifies the "activity" column by replacing "others" with "WALKING" if "WALKING" is contained in the "narrative" column
    pdf <- pdf |>
      mutate(activity = ifelse(activity == "others" & grepl("WALKING|WALK", narrative), "WALKING", activity))
    
    #modifies the "activity" column by replacing "others" with "STANDING" if "STANDING" is contained in the "narrative" column
    pdf <- pdf |>
      mutate(activity = ifelse(activity == "others" & grepl("STANDING|STAND", narrative), "STANDING", activity))
    
    #modifies the "activity" column by replacing "others" with "SITTING" if "SITTING" is contained in the "narrative" column
    pdf <- pdf |>
      mutate(activity = ifelse(activity == "others" & grepl("SITTING|SIT", narrative), "SITTING", activity))
    
    #modifies the "activity" column by replacing "others" with "Stair Navigation" if "FLIGHT" is contained in the "narrative" column
    pdf <- pdf |>
      mutate(activity = ifelse(activity == "others" & grepl("FLIGHT|STAIRS", narrative), "Stair Navigation", activity))
    
    #modifies the "activity" column by replacing "others" with "RISING" if "FLIGHT" is contained in the "narrative" column
    pdf <- pdf |>
      mutate(activity = ifelse(activity == "others" & grepl("GETTING|CHAIR|BED|STOOD UP", narrative), "RISING", activity))
    
    #---
    pdf <- pdf |>
      mutate(activity = ifelse(activity == "others" & grepl("SLIPPED|SLIP", narrative), "SLIPPED", activity))
    
    #---
    pdf <- pdf |>
      mutate(activity = ifelse(activity == "others" & grepl("TRIPPED|TRIP", narrative), "TRIPPED", activity))
    
    #---
    
    pdf <- pdf |>
      mutate(activity = ifelse(activity == "others" & grepl("BENDING|BENT|BEND|PICK UP|PICKING UP", narrative), "BENDING", activity))
    
    #---
    
    pdf <- pdf |>
      mutate(activity = ifelse(activity == "others" & grepl("MECH|MECHANICAL", narrative), "MECHANICAL", activity))
    
    #---
    
    pdf <- pdf |>
      mutate(activity = ifelse(activity == "others" & grepl("LOST BALANCE", narrative), "LOST BALANCE", activity))
    
    #---
    
    pdf <- pdf |>
      mutate(activity = ifelse(grepl("BASKETBALL|BASEBALL|BALL|SPORTS|SPORT|BILLIARDS|BOWLING|SKATING|GOLF|TENNIS|MOUNTAIN CLIMBING|SKIING|SOCCER|HOCKEY|FISHING|SWIMMING|MARTIAL ARTS|LACROSSE|TUBING|HORSEBACK RIDING|SURFING|WRESTLING|BADMINTON|SHUFFLEBOARD|FENCING", activity), "SPORTS", activity))
    • Cleaning Narratives by replacing medical & other abbreviations
    Code
    # Define the medical_terms dictionary
    medical_terms <- list(
      "&" = "and",
      "***" = "",
      ">>" = "clinical diagnosis",
      "@" = "at",
      "abd" = "abdomen",
      "af" = "accidental fall",
      "afib" = "atrial fibrillation",
      "aki" = "acute kidney injury",
      "am" = "morning",
      "ams" = "altered mental status",
      "bac" = "blood alcohol content",
      "bal" = "blood alcohol level,",
      "biba" = "brought in by ambulance",
      "c/o" = "complains of",
      "chi" = "closed-head injury",
      "clsd" = "closed",
      "cpk" = "creatine phosphokinase",
      "cva" = "cerebral vascular accident",
      "dx" = "diagnosis",
      "ecf" = "extended-care facility",
      "er" = "emergency room",
      "etoh" = "ethyl alcohol",
      "eval" = "evaluation",
      "fib" = "fibrillation",
      "fd" = "fall detected",
      "fx" = "fracture",
      "fxs" = "fractures",
      "glf" = "ground level fall",
      "h/o" = "history of",
      "htn" = "hypertension",
      "hx" = "history of",
      "inj" = "injury",
      "inr" = "international normalized ratio",
      "intox" = "intoxication",
      "l" = "left",
      "lac" = "laceration",
      "loc" = "loss of consciousness",
      "lt" = "left",
      "mech" = "mechanical",
      "mult" = "multiple",
      "n.h." = "nursing home",
      "nh" = "nursing home",
      "p/w" = "presents with",
      "pm" = "afternoon",
      "pt" = "patient",
      "pta" = "prior to arrival",
      "pts" = "patient's",
      "px" = "physical examination",
      "r" = "right",
      "r/o" = "rules out",
      "rt" = "right",
      "s'd&f" = "slipped and fell",
      "s/p" = "after",
      "sah" = "subarachnoid hemorrhage",
      "sdh" = "acute subdural hematoma",
      "sts" = "sit-to-stand",
      "t'd&f" = "tripped and fell",
      "tr" = "trauma",
      "uti" = "urinary tract infection",
      "w/" = "with",
      "w/o" = "without",
      "wks" = "weeks"
    )
    
    # Define the clean_narrative function
    clean_narrative <- function(text) {
      # Convert text to lowercase
      text <- tolower(text)
    
      # Define regex pattern for DX
      regex_dx <- "([\\W]*(dx)[\\W]*)"
      text <- gsub(regex_dx, ". dx: ", text)
    
      # Define regex pattern for age and sex
      regex_age_sex <- "(\\d+)\\s*?(yof|yf|yo\\s*female|yo\\s*f|yom|ym|yo\\s*male|yo\\s*m)"
      age_sex_match <- regexpr(regex_age_sex, text)
    
      # Format age and sex
      if (age_sex_match > 0) {
        age <- regmatches(text, age_sex_match)[[1]][1]
        sex <- regmatches(text, age_sex_match)[[1]][2]
    
        if ("f" %in% sex) {
          text <- gsub(age_sex_match, "patient", text)
        } else if ("m" %in% sex) {
          text <- gsub(age_sex_match, "patient", text)
        }
      }
    
      # Translate medical terms
      for (term in names(medical_terms)) {
        if (term %in% c("@", ">>", "&", "***")) {
          pattern <- paste0("(", gsub("[*]", "[*]", term), ")")
          text <- gsub(pattern, paste0(" ", medical_terms[[term]], " "), text)
        } else {
          pattern <- paste0("\\b(", gsub("[*]", "[*]", term), ")\\b")
          text <- gsub(pattern, medical_terms[[term]], text)
        }
      }
    
      # Capitalize sentences
      text <- gsub("(^|\\.[[:space:]]+)([a-z])", "\\1\\U\\2", text, perl = TRUE)
    
      # Convert text to uppercase
      #text <- toupper(text)
    
      return(text)
    }
    
    # Test the function
    input_text <- "The pt is a 45 yof who c/o abdominal pain. Dx: uti. She fell and has a left hip fx."
    cleaned_text <- clean_narrative(input_text)
    cat(cleaned_text)
    The patient is a 45 yof who complains of abdominal pain. . Diagnosis: : urinary tract infection. She fell and has a left hip fracture.
    Code
    ##############################
    ## speed notebook rendering ##
    ##############################
    
    ##run once or alternatively load "data/clean_narrative_data.csv"
    
    ## applying cleaning function to data
    
    # pdf$narrative_orig = pdf$narrative
    # pdf_0 <- pdf %>%
    #   mutate(narrative = map_chr(narrative, clean_narrative))
    Code
    ##############################
    ## speed notebook rendering ##
    ##############################
    
    ## speed up render by saving file to excel and loading it up.
    #fwrite(pdf_0, "data/clean_narrative_data.csv")
    Code
    ##############################
    ## speed notebook rendering ##
    ##############################
    pdf_0 = fread("data/clean_narrative_data.csv")

    General Analysis of Narratives

    In this section, text processing and text analysis tasks were performed on the cleaned narrative column.The code takes text data, removes certain specified words2 and stop words, tokenizes it into bigrams, counts the frequency of these bigrams, and calculates the percentage of occurrence for each bigram while performing various text cleaning and filtering operations along the way.

  • 2 re-occurring words that do not provide any insightful information e.g”yom”, “yof” etc

  • Code
    #https://paldhous.github.io/NICAR/2019/r-text-analysis.html
    pdf5 = pdf_0 %>%
      #filter(activity=="Stair Navigation")|>
      mutate(narrative = 
               gsub("\\bYOF\\b|\\bYOM\\b|\\bPT\\b|\\bDX\\b|\\byom\\b|\\byof\\b|\\bDx\\b|\\bDiagnosis\\b|\\bdx\\b|\\diagnosis", "", narrative,ignore.case = TRUE)) %>%
        unnest_tokens(word, narrative, token = "ngrams", n = 2)%>% #split each word as a row
      anti_join(stop_words)%>% #remove stop words
    
      count(word, sort = TRUE)
    Joining with `by = join_by(word)`
    Code
    # remove stop words
    pdf6 <- pdf5 %>%
      separate(word, into = c("first","second"), sep = " ", remove = FALSE) %>%
      anti_join(stop_words, by = c("first" = "word")) %>%
      anti_join(stop_words, by = c("second" = "word")) %>%
      filter(str_detect(first, "^[a-zA-Z]{3,}$") &
              str_detect(second, "^[a-zA-Z]{3,}$"))%>%
      mutate(percentage = n / sum(n) * 100) 

    “Head Injury” is the most re-occurring pair of words in the Narrative data

    Code
    as.datatable(formattable(pdf6|>
                               filter(percentage>0.25),digits=2, list(
      #n = percent,
      n = color_tile("transparent", "#a1caf1")
    
    )), rownames = F,
    filter = 'top', 
    options = list(
      pageLength = 10, autoWidth = F#,
      #order = list(list(4, 'desc'))#asc
    ),
    class = 'bootstrap'
    )

    Text Network Analysis

    Text network analysis can be used to represent the narratives as a network graph. The words are the nodes and their co-occurrences are the relations. With the narratives encoded as a network, advanced graph theory algorithms can be used to detect the most influential keywords, identify the main topics, the relations between them, and get insights into the structure of the discourse. By taking this approach, the focus is on the relations between the words, while retaining contextual information and the narrative. Unlike bag-of-words, LDA-based, or Word2Vec models which may lose information about the words sequence, text network can be built in a way that retains the narrative and, therefore, provides more accurate information about the text and its topical structure.

    Code
    pdf_net = pdf6|>
      #filter(n>1000)|>
      filter(percentage>0.25)|>
      select(first, second)|>
      rename(from = first, to = second)
    
    network = graph_from_data_frame(d = pdf_net)
    
    network2 = toVisNetworkData(network)
    n = data.frame(network2$nodes, font.size = 30)
    e = data.frame(network2$edges)
    visNetwork(n,e)|> 
      visIgraphLayout(layout = "layout_with_kk",#layout_on_grid, layout.star, layout_on_sphere, layout_with_kk
                      physics = F)|>#layout_in_circle
        visNodes(size = 30) |>
      #visEdges(arrows = "from")|>
      visOptions(highlightNearest = list(enabled = T, hover = T, degree =1),
                 nodesIdSelection = T)
    ## pagerank
    pagerank = network%>%
      as_tbl_graph() %>%
      mutate(pagerank = centrality_pagerank())|>
      as_tibble()%>%
      arrange(desc(pagerank))
    
    as.datatable(formattable(pagerank, digits=2,list(
      pagerank = percent,
      pagerank = color_tile("transparent", "#a1caf1")
    
    )), rownames = F,
    filter = 'top', 
    options = list(
      pageLength = 10, autoWidth = F#,
      #order = list(list(4, 'desc'))#asc
    ),
    class = 'bootstrap'
    )

    Text Summarization

    This section builds on the previous, by leveraging TextRank, which is based on the PageRank algorithm to extract sentences i.e. extractive text summarization. in this analysis, sentences are modelled as the vertices and words as the connection edges. So sentences with words that appear in many other sentences are seen as more important.

    Code
    #https://www.emilhvitfeldt.com/post/2018-03-15-tidy-text-summarization/
    
    pdf5b = pdf_0 %>%
      filter(age>80)|>
      mutate(narrative = 
               gsub("\\bYOF\\b|\\bYOM\\b|\\bPT\\b|\\bDX\\b|\\bdx\\b|\\bDiagnosis\\b|\\DX\\b|\\DX|\\dx|\\yof|\\yom|\\yf|\\ym", "", narrative,ignore.case = TRUE))|>
      head(50)
    Code
    article_sentences <-pdf5b %>%
      unnest_tokens(sentence, narrative, token = "sentences") %>%
      mutate(sentence_id = row_number()) %>%
      select(sentence_id, sentence)
    
    article_words <- article_sentences %>%
      unnest_tokens(word, sentence)%>%
      anti_join(stop_words, by = "word")
    
    #Running TextRank
    article_summary <- textrank_sentences(data = article_sentences, 
                                          terminology = article_words)
    • An Overview of the top 5 sentences based on the first 50 narratives for Adults older than 80 years
    Code
    #extracting the top 3 
    article_summary[["sentences"]] %>%
      arrange(desc(textrank)) %>% 
      slice(1:5) %>%
      pull(sentence)
    [1] "94 fell to the floor at the nursing home onto back of head sustained a subdural hematoma"                     
    [2] "88 fell to the floor at the nursing home and sustained a laceration to face"                                  
    [3] "93 was walking at the nursing home and tripped and fell to the floor onto head sustained a closed head injury"
    [4] "92 fell to carpeted floor at the nursing home and sustained a hip fracture"                                   
    [5] "95 fell to the floor at home and sustained a hip fracture"                                                    

    Key takeaway(s)

    • Fall accidents tend to occur at the nursing home, for adults older than 85 years of age

    Modelling: Identifying themes based on narrative embedding

    In this section, two clustering algorithms K-means and DBSCAN were experimented with to test the efficacy in identifying theme clusters.

    K-means clustering is the most commonly used unsupervised machine learning algorithm for partitioning a given data set into a set of k groups (i.e. k clusters), where k represents the number of groups pre-specified by the analyst.The basic idea behind k-means clustering consists of defining clusters so that the total intra-cluster variation (known as total within-cluster variation) is minimized.

    DBSCAN is a density-based clustering algorithm, which can be used to identify clusters of any shape in data set containing noise and outliers. The key idea is that for each point of a cluster, the neighborhood of a given radius has to contain at least a minimum number of points.The goal is to identify dense regions, which can be measured by the number of objects close to a given point.

    Processing step: Applying UMAP reduction step to the PCA processed data, to present data in 2-dimensional space.

    Code
    set.seed(123)
    #create recipe
    recipe_object_2 = recipe(~.,-cpsc_case_number , data = emb2d)|>#emb2c
      step_umap(all_numeric_predictors(),-cpsc_case_number
                ) 
     
    #extract table
    kmeans_tbl = recipe_object_2|>
      prep()|>
      juice()
    Code
    k_tbl2 = kmeans_tbl|>
      slice_sample(n=1000)
    
    #silhouette,mb 
    fviz_nbclust(k_tbl2,
                 FUNcluster = kmeans,
                 method = c("silhouette"),
                 diss = NULL,
                 k.max = 10,
                 nboot = 100)

    Code
    #gap stats
    fviz_nbclust(k_tbl2,
                 FUNcluster = kmeans,
                 method = c("gap_stat"),
                 diss = NULL,
                 k.max = 10,
                 nboot = 10)

    Code
    #elbow method
    fviz_nbclust(k_tbl2,
                 FUNcluster = kmeans,
                 method = c("wss"),
                 diss = NULL,
                 k.max = 10,
                 nboot = 10)

    Code
    #Specifying clustering models, arbitrarily set the number of clusters to 4
    kmeans_spec_best_emb <- k_means(num_clusters = 4) %>% 
      set_engine("ClusterR")
    
    #create workflow
    kmeans_wf_best_emb <- workflow()|>
      add_recipe(recipe_object_2)|>
      add_model(kmeans_spec_best_emb)
    
    #fit model
    kmeans_best_fit_mdl_emb <- kmeans_wf_best_emb|>
      fit(data = emb2d) #emb2d
    
    kmeans_best_fit_mdl_emb
    ══ Workflow [trained] ══════════════════════════════════════════════════════════
    Preprocessor: Recipe
    Model: k_means()
    
    ── Preprocessor ────────────────────────────────────────────────────────────────
    1 Recipe Step
    
    • step_umap()
    
    ── Model ───────────────────────────────────────────────────────────────────────
    KMeans Cluster
     Call: ClusterR::KMeans_rcpp(data = data, clusters = clusters, num_init = num_init,      max_iters = max_iters, initializer = initializer, fuzzy = fuzzy,      verbose = verbose, CENTROIDS = CENTROIDS, tol = tol, tol_optimal_init = tol_optimal_init,      seed = seed) 
     Data cols: 3 
     Centroids: 4 
     BSS/SS: 0.9873194 
     SS: 1.602937e+19 = 2.032624e+17 (WSS) + 1.58261e+19 (BSS)
    Code
    #predict cluster
    pdf_cluster_emb =kmeans_best_fit_mdl_emb %>% #kmeans_fit_emb|>
      augment(emb2d) #emb2c
    
    pdf_pca_emb = recipe_object_2 |>
      prep()|>
      juice()
    
    #merge dataframe
    pdf_cluster_emb_merge = pdf_cluster_emb|>
      select(cpsc_case_number, .pred_cluster)|>
      left_join( pdf_0, by = "cpsc_case_number")|>
      bind_cols(pdf_pca_emb)
    New names:
    • `cpsc_case_number` -> `cpsc_case_number...1`
    • `cpsc_case_number` -> `cpsc_case_number...27`

    Visualizing Clusters

    Code
    emb_plot = pdf_cluster_emb_merge %>%
     #filter(!(other_race %in% "")) %>%
     ggplot() +
     aes(x = UMAP1, y = UMAP2, colour = .pred_cluster, 
         #size = age, 
         text = activity) +
     geom_point(shape = "circle") +
     scale_color_hue(direction = 1) +
     #geom_mark_ellipse(aes(color = .pred_cluster), expand = unit(0.5,"mm"))+
      theme_minimal()
    #emb_plot
    ggplotly(emb_plot,tooltip = "text")

    Identifying the optimal “eps” parameter

    Code
    #https://stats.stackexchange.com/questions/88872/a-routine-to-choose-eps-and-minpts-for-dbscan
    pdf_pca_emb_db = pdf_pca_emb|> select(-cpsc_case_number)
    dbscan::kNNdistplot(pdf_pca_emb_db, k =  450)
    abline(h = 0.4, lty = 2)

    Code
    set.seed(123)
    pdf_dbscan = dbscan(pdf_pca_emb_db, eps = 0.4, minPts = 550)
    Code
    # Plot DBSCAN results
    #hullplot(pdf_pca_emb_db, pdf_dbscan$cluster)

    Visualize Cluster

    Code
    pdf_cluster_emb_merge_db = pdf_cluster_emb_merge|>
      cbind(pdf_dbscan$cluster)|>
      rename( db_cluster = "pdf_dbscan$cluster")|>
      mutate(db_cluster = as_factor(db_cluster))
    
    emb_plot_db = pdf_cluster_emb_merge_db %>%
     #filter(!(other_race %in% "")) %>%
     ggplot() +
     aes(x = UMAP1, y = UMAP2, colour = db_cluster, 
         #size = age, 
         text = activity) +
     geom_point(shape = "circle",size = 0.5) +
     scale_color_hue(direction = 1) +
     #geom_mark_ellipse(aes(color = .pred_cluster), expand = unit(0.5,"mm"))+
      theme_minimal()
    #emb_plot
    ggplotly(emb_plot_db,tooltip = "text")

    Key takeaway:

    • Effectiveness of Embeddings: Using embeddings proved effective in extracting themes.
    • DBSCAN vs. K-means: Density-based clustering (DBSCAN) demonstrated greater efficacy in identifying clusters compared to the k-means approach.
    • Theme Identification: Density-based clustering revealed the presence of 9 major themes, with cluster 0 being categorized as outlier/general themes

    Understanding themes

    In this section, theme clusters from the density-based algorithm are explored3 in relation to the “activities” associated with falls.

  • 3 Note: Click on the legend to isolate a cluster theme or themes from others

  • Code
    pdf_cluster_emb_rdr_1bb <- pdf_cluster_emb_merge_db %>%
      #filter(.pred_cluster == "Cluster_5") %>%
      #group_by(.pred_cluster)|>
      group_by(db_cluster)|>
      count(activity, sort = TRUE) %>%
      mutate(percentage = n / sum(n) ) %>%
      filter(percentage > 0.02) %>%
      select(activity, percentage) |>
      pivot_wider(names_from = activity, values_from = percentage)|>
      mutate_all(~replace_na(.x, 0))
    
    pdf_cluster_emb_rdr_1c = pdf_cluster_emb_rdr_1bb|>
      as_tibble()|>
      #mutate_each(funs(rescale), -.pred_cluster)
      mutate_each(funs(rescale), -db_cluster)
    
    
    
    rd = pdf_cluster_emb_rdr_1c|>
      ggradar(
        #group.colours = palette_light()|>unname(),
        #fill = T,
        #fill.alpha = 0.1,
        plot.title = "Radar Chart",
        group.line.width = 1,
        group.point.size = 1,
        font.radar = "ariel" ,
        axis.label.size = 3,
        grid.label.size = 5
      )+ theme(
        text = element_text(family = "ariel"),
        plot.title = element_text(size = 12)
      ) #+ facet_wrap(~.pred_cluster,ncol = 3)
    ggplotly(rd)

    Exploring clusters

    • Visualizing Insights on cluster 6
    Code
    selection = "6"
    pdf_cluster_emb_1 = pdf_cluster_emb_merge_db %>% #pdf_cluster_emb_merge
      #filter(.pred_cluster == "Cluster_1")|>
      filter(db_cluster == selection)|>
      mutate(narrative = 
               gsub("\\byof\\b|\\byom\\b|\\bPT\\b|\\bdx\\b|\\bDiagnosis\\b|\\bhead\\b|\\bhip\\b|\\bleg\\b|\\bscalp\\b|\\bskin\\b|\\barm\\b|\\bknee\\b|\\belbow\\b|\\bshoulder\\b|\\bneck\\b|\\bchest\\b|\\bforehead\\b|\\bwrist\\b|\\brib\\b|\\bhit\\b|\\bhitting\\b|\\bclosed\\b", "", narrative,ignore.case = TRUE)) %>%
        unnest_tokens(word, narrative, token = "ngrams", n = 2)%>% #split each word as a row
      anti_join(stop_words)%>% #remove stop words
    
      count(word, sort = TRUE)%>%
      separate(word, into = c("first","second"), sep = " ", remove = FALSE) %>%
      anti_join(stop_words, by = c("first" = "word")) %>%
      anti_join(stop_words, by = c("second" = "word")) %>%
      filter(str_detect(first, "^[a-zA-Z]{3,}$") &
              str_detect(second, "^[a-zA-Z]{3,}$"))%>%
      mutate(percentage = n / sum(n) * 100)
    Joining with `by = join_by(word)`
    Code
    pdf_cluster_emb_net_1 = pdf_cluster_emb_1|>
      #filter(n>20)|>
      filter(percentage>0.25)|>
      select(first, second,n)|>
      rename(from = first, to = second)
    
    network_emb_c1 = graph_from_data_frame(d = pdf_cluster_emb_net_1)
    network_emb_c1_2 = toVisNetworkData(network_emb_c1)
    n = data.frame(network_emb_c1_2$nodes, font.size = 30)
    
    
    e = data.frame(network_emb_c1_2$edges)
    visNetwork(n,e)|> #, height = "600px",width = "600px"
      visIgraphLayout(layout = "layout_with_kk",#"layout.star",#"layout_with_kk",
                      physics = F)|>#layout_in_circle
        visNodes(size = 30) |>
      #visEdges(arrows = "from")|>
      visOptions(highlightNearest = list(enabled = T, hover = T, degree =1),
                 nodesIdSelection = T)
    Code
    ## pagerank
    pagerank_1 = network_emb_c1%>%
      as_tbl_graph() %>%
      mutate(pagerank = centrality_pagerank())|>
      as_tibble()%>%
      arrange(desc(pagerank))
    
    as.datatable(formattable(pagerank_1, digits=2,list(
      pagerank = percent,
      pagerank = color_tile("transparent", "#a1caf1")
    
    )), rownames = F,
    filter = 'top', 
    options = list(
      pageLength = 10, autoWidth = F#,
      #order = list(list(4, 'desc'))#asc
    ),
    class = 'bootstrap'
    )

    Cluster Text Summarization

    Code
    pdf_sum_1 = pdf_cluster_emb_merge_db %>%
      #filter(.pred_cluster == "Cluster_1")|>
      filter(db_cluster == selection)|>
      mutate(narrative = 
               gsub("\\byof\\b|\\byom\\b|\\bPT\\b|\\bdx\\b|\\bDiagnosis\\b|\\dx\\b|\\dx|\\yof|\\yom|\\yf|\\ym|\\:", "", narrative,ignore.case = TRUE))|>
     slice_sample(n=100)#, by=c(sex,product_1,diagnosis)
    Code
    article_sentences_1 <-pdf_sum_1 %>%
      unnest_tokens(sentence, narrative, token = "sentences") %>%
      mutate(sentence_id = row_number()) %>%
      select(sentence_id, sentence)
    
    article_words_1 <- article_sentences_1 %>%
      unnest_tokens(word, sentence)%>%
      anti_join(stop_words, by = "word")
    
    #Running TextRank
    article_summary_1 <- textrank_sentences(data = article_sentences_1, 
                                          terminology = article_words_1)
    Code
    #extracting the top 3 
    article_summary_1[["sentences"]] %>%
      arrange(desc(textrank)) %>% 
      slice(1:10) %>%
      pull(sentence)
     [1] "65 was intoxicated blood alcohol content 151 and fell onto her head on the floor .   closed head injury; acute alcohol intoxication"                                                               
     [2] "65  fell and struck head on wooden floor  while intoxicated with no blood alcohol level,   .    head injury, alcohol intoxication"                                                                 
     [3] "76 was intoxicated blood alcohol content 208 and fell to the bathroom floor onto head .   closed head injury; acute alcohol intoxication"                                                          
     [4] "74 was intoxicated and fell down a flight of stairs blood alcohol content of 198 .   closed head injury acute ethyl alcohol intoxication"                                                          
     [5] "74  ground level fall after drinking, fell onto table .   scalp laceration, ethyl alcohol abuse no blood alcohol content"                                                                          
     [6] "75 drinking ethyl alcohol at a bar, passed out and fell to floor hitting face blood alcohol content not done.   .   facial fractures, closed head injury, acute ethyl alcohol intoxication"        
     [7] "80 per report patient presents after drinking ethyl alcohol tonight and fell from bed and not acting like himself blood alcohol level, 302 .   fall ethyl alcohol intoxication forehead laceration"
     [8] "77 was at home intoxicated blood alcohol content 205 and fell to the floor onto her head .   subarachnoid hemorrhage; facial laceration; alcohol intoxication"                                     
     [9] "65  drinking alchol and fell down stairs hitting forehead  with no blood alcohol level, done  .    laceration forehead, alcohol intoxication"                                                      
    [10] "84 fell backwards and struck head on floor+ethyl alcohol,blood alcohol content>292--.  laceration scalp"                                                                                           

    Summary overview of all cluster themes

    Cluster Theme Associated Activities Obstacle Injury Top 3 Keywords (excluding the term “Fall”)
    0 General Elderly Falls and Injuries Lost Balance Ladders, others not specified Others left, admit, contusion
    1 Head Injuries from Falls Standing, Rising Bed or bed-frames Laceration injury, laceration, contusion
    2 Falls Resulting in Shoulder Injuries Tripped, Exercise, Sports Exercise Dislocation, Avulsion, Strain & Sprain fracture, left, humerus
    3 Hip Injuries from Falls Rising, Tripped Footwear Fracture, Strain & Sprain fracture, left, femur
    4 Syncope-Related Head Injuries Fainted Toilets Laceration syncope, laceration, striking
    5 Rib Injuries from Falls Standing Bath-tubs or Showers Fracture left, fracture, ribs
    6 Alcohol-Related Head Injuries and Falls Stair Navigation, others Stairs or steps Poisoning, Laceration alcohol, blood, intoxication
    7 Buttocks Contusions from Falls Rising, sitting, slipped bed or bed-frames Contusions contusions, buttocks, lower
    8 Atrial fibrillation related falls Sitting, Standing Tables, rugs & carpets, Ceilings & Walls Hermatomia encounter, laceration, initial
    9 Floor Falls and Associated Injuries Walking, Slipped Floors , balconies Contusions Falling, Floor, Dizzy

    Further Exploration

    In this section, building on the understanding of the cluster themes, these themes are further explored in relation to other variables like Age, severity4 level, sex etc

  • 4 see appendix for severity level definitions

  • Code
    #Cluster themes by age distribution
    plot_2 = ggplot(pdf_cluster_emb_merge_db) +
     aes(x = db_cluster, y = age) +
     geom_boxplot(fill = "#AEC8DF") +
     labs(x = "Cluster themes",title = "Cluster themes by age distribution") +
     theme_minimal()
    ggplotly(plot_2)
    Code
    #Cluster themes by Severity levels
    plot_3 =ggplot(pdf_cluster_emb_merge_db) +
     aes(x = db_cluster, fill = severity_level) +
     geom_bar() +
     scale_fill_brewer(palette = "Blues", 
     direction = 1) +
     labs(x = "Cluster themes", y = "Number of narrative", title = "Cluster themes by Severity levels", 
     caption = "..", fill = "Severity Level") +
     theme_minimal()
    ggplotly(plot_3)
    Code
    plot_4 = ggplot(pdf_cluster_emb_merge_db) +
     aes(x = db_cluster, fill = sex) +
     geom_bar() +
     scale_fill_brewer(palette = "Blues", 
     direction = 1) +
     labs(x = "Cluster Themes", y = "Number of Narratives", title = "Cluster themes by sex") +
     theme_minimal()
    
    ggplotly(plot_4)
    Code
    plot_5 = pdf_cluster_emb_merge_db %>%
     filter(!(location %in% "UNK")) %>%
     ggplot() +
     aes(x = db_cluster, fill = location) +
     geom_bar() +
     scale_fill_brewer(palette = "Blues", 
     direction = 1) +
     labs(x = "Cluster Theme", y = "Number of Narratives", title = "Cluster themes by incident location") +
     theme_minimal()
    
    ggplotly(plot_5)
    Code
    plot_6 = pdf_cluster_emb_merge_db|>
      group_by(treatment_date,db_cluster)|>
      summarise(
        cases = n()
      )
    
    
    plot_6a = ggplot(plot_6) +
     aes(x = treatment_date, y = cases, colour = db_cluster) +
     geom_line() +
     scale_color_hue(direction = 1) +
     labs(y = "Cases", title = "Trend of Cluster Themes", color = "Cluster themes") +
     theme_minimal()
    
    
     ggplotly(plot_6a)

    Summary table view of the average number of cases for each cluster theme across the different years

    Code
    plot_7 = plot_6|> mutate(year = year(treatment_date))|>
      group_by(year,db_cluster)|>
      summarise(
        avg_cases = mean(cases)
      )|>
      pivot_wider(names_from = year, values_from = avg_cases)
    plot_7$change_from_2021 =( plot_7$"2022"/plot_7$"2021")-1
    
    as.datatable(formattable(plot_7, digits=2,list(
      change_from_2021 = percent,
      "2019" = color_tile("transparent", "#a1caf1"),
      "2020" = color_tile("transparent", "#a1caf1"),
      "2021" = color_tile("transparent", "#a1caf1"),
      "2022" = color_tile("transparent", "#a1caf1"),
      change_from_2021 = color_tile("transparent", "#a1caf1")
    
    )), rownames = F,
    filter = 'top', 
    options = list(
      pageLength = 10, autoWidth = F#,
      #order = list(list(4, 'desc'))#asc
    ),
    class = 'bootstrap'
    )

    Conclusion

    • Combining embeddings with dimensionality reduction techniques has proven to be highly effective in the extraction of cluster themes.
    • DBSCAN outperforms k-means in cluster identification.
    • Patients in the “Alcohol-Related Head Injuries and Falls” group tend to be younger, while the “Atrial fibrillation related falls” group was generally older
    • The “Syncope-Related Head Injuries” group had a higher rate of severe cases compared to other groups.
    • In comparison to previous year (2021), cases involving “Head Injuries from Falls”, “Syncope-Related Head Injuries” and “Rib Injuries from Falls” saw the most significant increase in the average number of cases.

    Appendix

    • Disposition classification from which the severity levels were derived from
    Disposition Code Category
    1 Not Severe
    2 Not Severe
    4 Severe
    5 Severe
    6 Not Severe
    8 Severe
    9 Not Severe

    In this classification:

    • “Severe” includes disposition codes 4 (Treated and admitted for hospitalization), 5 (Held for observation), and 8 (Fatality, including DOA and deaths in the ED or after admission).

    • “Not Severe” includes disposition codes 1 (Treated and released, or examined and released without treatment, or transfers for treatment to another department of the same facility without admission), 2 (Treated and transferred to another hospital), 6 (Left without being seen, Left against medical advice, Left without treatment, Eloped), and 9 (Not recorded).

    Code
    sessionInfo()
    R version 4.3.1 (2023-06-16 ucrt)
    Platform: x86_64-w64-mingw32/x64 (64-bit)
    Running under: Windows 11 x64 (build 22621)
    
    Matrix products: default
    
    
    locale:
    [1] LC_COLLATE=English_United States.utf8 
    [2] LC_CTYPE=English_United States.utf8   
    [3] LC_MONETARY=English_United States.utf8
    [4] LC_NUMERIC=C                          
    [5] LC_TIME=English_United States.utf8    
    
    time zone: America/Los_Angeles
    tzcode source: internal
    
    attached base packages:
    [1] stats     graphics  grDevices datasets  utils     methods   base     
    
    other attached packages:
     [1] cluster_2.1.4      dbscan_1.1-11      factoextra_1.0.7   textrank_0.3.1    
     [5] fmsb_0.7.5         janitor_2.2.0      embed_1.1.2        tidygraph_1.2.3   
     [9] formattable_0.2.1  reactable_0.4.4    DT_0.30            ggradar_0.2       
    [13] plotly_4.10.3      data.table_1.14.8  arrow_13.0.0.1     hunspell_3.0.3    
    [17] igraph_1.5.1       visNetwork_2.1.2   tidytext_0.4.1     tidyclust_0.2.0   
    [21] jsonlite_1.8.7     LSAfun_0.6.3       rgl_1.2.1          lsa_0.73.3        
    [25] SnowballC_0.7.1    vroom_1.6.4        lubridate_1.9.3    forcats_1.0.0     
    [29] stringr_1.5.1      readr_2.1.4        tidyverse_2.0.0    yardstick_1.2.0   
    [33] workflowsets_1.0.1 workflows_1.1.3    tune_1.1.2         tidyr_1.3.0       
    [37] tibble_3.2.1       rsample_1.2.0      recipes_1.0.8      purrr_1.0.2       
    [41] parsnip_1.1.1      modeldata_1.2.0    infer_1.0.5        dplyr_1.1.2       
    [45] dials_1.2.0        scales_1.2.1       broom_1.0.5        tidymodels_1.1.1  
    [49] ggplot2_3.4.4     
    
    loaded via a namespace (and not attached):
      [1] RColorBrewer_1.1-3  rstudioapi_0.15.0   magrittr_2.0.3     
      [4] farver_2.1.1        rmarkdown_2.25      vctrs_0.6.4        
      [7] base64enc_0.1-3     rstatix_0.7.2       htmltools_0.5.7    
     [10] janeaustenr_1.0.0   sass_0.4.7          parallelly_1.36.0  
     [13] bslib_0.5.1         htmlwidgets_1.6.2   tokenizers_0.3.0   
     [16] cachem_1.0.8        whisker_0.4.1       lifecycle_1.0.4    
     [19] iterators_1.0.14    pkgconfig_2.0.3     Matrix_1.6-1       
     [22] R6_2.5.1            fastmap_1.1.1       snakecase_0.11.1   
     [25] future_1.33.0       digest_0.6.33       colorspace_2.1-0   
     [28] furrr_0.3.1         irlba_2.3.5.1       crosstalk_1.2.0    
     [31] ggpubr_0.6.0        labeling_0.4.3      fansi_1.0.5        
     [34] tfruns_1.5.1        timechange_0.2.0    abind_1.4-5        
     [37] httr_1.4.7          compiler_4.3.1      bit64_4.0.5        
     [40] withr_2.5.2         backports_1.4.1     carData_3.0-5      
     [43] tensorflow_2.14.0   ggsignif_0.6.4      MASS_7.3-60        
     [46] lava_1.7.3          tools_4.3.1         future.apply_1.11.0
     [49] nnet_7.3-19         glue_1.6.2          grid_4.3.1         
     [52] modelenv_0.1.1      keras_2.13.0        generics_0.1.3     
     [55] gtable_0.3.4        tzdb_0.4.0          class_7.3-22       
     [58] hms_1.1.3           car_3.1-2           utf8_1.2.4         
     [61] RcppAnnoy_0.0.21    ggrepel_0.9.4       foreach_1.5.2      
     [64] pillar_1.9.0        splines_4.3.1       lhs_1.1.6          
     [67] lattice_0.21-8      gmp_0.7-2           renv_1.0.3         
     [70] survival_3.5-5      bit_4.0.5           tidyselect_1.2.0   
     [73] knitr_1.45          xfun_0.41           hardhat_1.3.0      
     [76] timeDate_4022.108   stringi_1.8.1       DiceDesign_1.9     
     [79] lazyeval_0.2.2      yaml_2.3.7          evaluate_0.23      
     [82] codetools_0.2-19    cli_3.6.1           uwot_0.1.16        
     [85] rpart_4.1.19        reticulate_1.34.0   jquerylib_0.1.4    
     [88] munsell_0.5.0       Rcpp_1.0.11         globals_0.16.2     
     [91] zeallot_0.1.0       png_0.1-8           parallel_4.3.1     
     [94] ellipsis_0.3.2      gower_1.0.1         assertthat_0.2.1   
     [97] ClusterR_1.3.1      GPfit_1.0-8         listenv_0.9.0      
    [100] viridisLite_0.4.2   ipred_0.9-14        prodlim_2023.08.28 
    [103] crayon_1.5.2        rlang_1.1.2